Neotoma
|
This workshop is designed to use paleoecological data and associated internet resources to provide participants with the following:
Guidance on best practices in the archiving and analysis of paleovertebrate data
Training in the use of the Neotoma Paleoecology Database (www.neotomadb.org) to archive, access, and analyze paleoecological data.
Neotoma is a multiproxy paleo-database that stores a range of paleoecological & paleoenvironmental data, including vertebrate faunal data. One of the strengths of Neotoma is the ability to compare faunal data with other proxy data such as fossil pollen, diatoms, ostracodes, insects, charcoal, and geochemical data. In addition, the database is structured to relate absolute dates to taxon occurrences and to allow the creation and storage of age models built on absolute dates from stratigraphic sections. Neotoma is a public-access, community-supported database that is emerging as the standard repository for Pliocene and Quaternary paleoecological data.
More teaching materials can be found in Neotoma’s educational resources.
QUESTION 1: What is the latitude and longitude of New Paris #4? What is the Site ID?
QUESTION 2: How many total sites are found in that region?
QUESTION 3: In which states has Ernie worked?
QUESTION 4: How many sites have Antilocapra records?
QUESTION 5: What state has the easternmost location of Antilocapra in this search? (For comparison, the eastern range limit of Antilocapra today is in western Iowa).
QUESTION 6: The generation of fossil records is labor intensive and hence expensive - e.g. the costs of fieldwork, the money spent on radiocarbon dates, the time required for a trained analysis to identify specimens, etc. A rough time/cost estimate for a single vertebrate fossil record is on the order of two years and $30,000. Given this, give an order-of-magnitude estimate of the number of person-years and dollars it took to generate these fossil vertebrate records now stored in Neotoma. (Order-of-magnitude = 10 person-years? 100 person-years? 1000? etc.)
QUESTION 7: Does mastodon tend to live in places with spruce, or without spruce? Suggest two hypotheses that might explain the observed association (or lack of one).
QUESTION 8: Describe the history of Mammuthus distributions in North America over the last 21,000 years.
QUESTION 9: What other information would you want about this date to determine authenticity?
QUESTION 11: What is the difference between information in the “Chronology” tab of the Vertebrate dataset versus the information within the Geochronology dataset?
QUESTION 12: When was the last time Aplodontia rufa was seen at the site?
? are processed as parameters to be passed to the server. Most APIs use calls to different URLs to return different kinds of data or the specific data that meets your query. We can think of http://google.com/?q=bananas as if it were an R function called search(), passed as search(q="bananas").
Many public APIs exist. For example, Google has APIs for most of its products. You could write an R (or Python, etc.) script that would connect to the Google Calendar API to allow you to automatically change events on your calendar or report agenda items back to you in your own custom environment. Hilary Parker has a nice example for adding sunsets to Google Calendar using R.
APIs are an important part of many important biological and paleobiological databases, because they allow users to filter, search, and download records using custom scripts. Different APIs have different parameter names, structures, and can produce different types of returns, including raw text, comma-delimited test, and JSON (JavaScript Object Notation).
We will work through examples from the Paleobiology Database and Neotoma.
Goal: Get all occurrences of Camelidae in the Paleobiology database that are dated to the Pliestocene, returning only the spatial location of the occurrence. Get the response as comma-separated text (CSV format).
https://paleobiodb.org/data1.2/occs/list.txt.?, followed by the parameter key-value pairs. base_name=Camelidae& searches for a taxonomic name at any level of the Linnaean hierarchy and returns any occurrences with that name, its synonyms, or its subtaxa. Try changing to a name of interest to you.& between key-value pairs. &interval=Pleistocene returns occurrences that fall within the specific geologic interval. Intervals can include North American Land Mammal Ages as well as the conventional Geologic Timescale. Try “Blancan” and “Burdigalian” to see what happens.show parameter. show=loc limits the data of the response to only include the location of the occurrences.You should end up with a URI string like this, which can be typed into your web browser:
https://paleobiodb.org/data1.2/occs/list.txt?base_name=Camelidae&interval=Pleistocene&show=loc
There are many useful parameters to facilitate searching occurreneces in the Paleobiology database, including:
lngmin and lngmax as well as latmin and latmax)The list of potential parameters is here: https://paleobiodb.org/data1.2/occs/list_doc.html.
Try for a few minutes to explore the data with different calls, using the call list.
The full documentation for the Paelobiology Database API is located at https://paleobiodb.org/data1.2/
Text returns are easily read by humans, but are often not suitable for the nested data structure saved in relational databases. JavaScript Object Notation (JSON) is a widely used data format that can encode these hierarchical relationships for sending across the Web, is easy to read, and can be consumed by many scripting languages (javascript, R, python, etc). JSON is a newer alternative to XML (eXtensible Markup Language), which you may have seen before. JSON is becoming the standard for data transfer in web services and R has several packages for dealing with JSON-formatted data.
JSON is composed of key-value pairs, enclosed by curly brackets and separated by commas (objects). You may also present an array, or an ordered collection of values, enclosed in square brackets. Values may be their own objects, allowing nested relationships between data.
As an example, you could present an occurrence like this example from the Neotoma API:
{
"SiteLongitudeWest": -103.31666666666666,
"SiteLatitudeSouth": 34.283333333333339,
"TaxonName": "Smilodon fatalis",
"VariableElement": "bone/tooth",
"Value": 1.0,
"VariableContext": null,
"TaxaGroup": "MAM",
"SampleAgeYounger": 15332.0,
"SampleAgeOlder": 30041.0,
"SiteLongitudeEast": -103.31666666666666,
"SiteAltitude": 1280.0,
"VariableUnits": "present/absent",
"DatasetID": 4564,
"SampleAge": null,
"SiteLatitudeNorth": 34.283333333333339
}
Can you tell what kind of occurrence this JSON object is describing? How old is it? Where is it located?
Goal: Search Neotoma for all Smilodon occurrences, and put them into an R data frame.
http://api.neotomadb.org/v1/data/sampledata?taxonname=Smilodon*
http://api.neotomadb.org/v1/data/sampledata?taxonname=Smilodon*&format=pretty
You should be able to see the nested set of JSON objects, including the occurrences returned as comma-separated objects within "data".
Try experimenting with the search, substituting different names. We’ll do more with the Neotoma API in a bit.
JSON is becoming the standard for data transfer in web services. R has several packages for dealing with JSON-formatted data. We will use some examples from the package RJSONIO. We will also use the package RCurl, which has functions to let you query APIs from within the R environment. The neotoma package is built around the newer httr package.
# Uncomment this line if you haven't already installed any of these packages:
# install.packages(c("RCurl", "RJSONIO"))
# Add the packages to your programming environment
library(RCurl)
library(RJSONIO)
You will also want to make sure you have the latest version of R installed, or else the secure connection (https) won’t work in the following queries.
Now create a query for the PaleoBioDB API using the example from above:
q <- "https://paleobiodb.org/data1.2/occs/list.txt?base_name=Camelidae&interval=Pleistocene&show=loc,class"
Create an object to receive the results:
a <- basicTextGatherer()
And execute that query:
curlPerform(url = q, writefunction = a$update)
Finally, view the data:
a$value()
You can see that the data have come in as a character vector; a long list of text strings with no clear structure. Luckily, the PaleoBioDB API also has a JSON interface.
Change your query to refer to list.json:
q <- "https://paleobiodb.org/data1.2/occs/list.json?base_name=Camelidae&interval=Pleistocene&show=loc,class"
Rerun your query and look at a$value( ) again. How does it look now?
a <- basicTextGatherer()
curlPerform(url = q, writefunction = a$update)
a$value()
You can convert it to a data frame:
tmp <- fromJSON(a$value())
records <- tmp$records
results <- data.frame(records[1], stringsAsFactors = FALSE)
for (x in records[-1]) {
x <- data.frame(x, stringsAsFactors = FALSE)
results <- merge(results, x, by = intersect(names(results),names(x)), all = TRUE)
}
dim(results) # this shows the dimensions of the data frame
head(results) # using head() here only prints the top 6 rows, out of 424
When you have gotten this code to work, go through and make comments to describe what each section is doing. Remember, you can make comments in your R code by placing a hashtag (#) at the beginning of a line, or after a line has run (as with the last few lines in the above chunk of code).
Try playing with your queries to see what you can pull from the PaleoBioDB, or even extending to the other two APIs we have explored.
In the previous example, you can see that the example API call is reporting data from only part of the distributed database schema of Neotoma. In fact, the Neotoma API is designed around a set of different URLs, each of which allows a user to search a portion of the database. So, if you want the site information for occurrences of Smilodon, you would have to search on Smilodon in the SampleData URI (as we did in the example), pull the DatasetID values from those returns, then search on the Dataset ID (http://api.neotomadb.org/v1/data/datasets) for those DatasetIDs, which would, in turn, produce the SiteIDs, which you would then search on the Sites URI (http://api.neotomadb.org/v1/data/sites). This sort of searching would be cumbersome if you were to do it by hand, but fortunately you can script a computer to do it for you. In fact, you don’t have to write the scripts to do it, because they have already been constructed and provided to the community as the R neotoma package.
Other APIs also have wrapper packages to simplify data calls in R. The PaleoBioDB has a package, paleobioDB, and iDigBio has a package, ridigbio. Currently, a large group of collaborators is working on a single API and wrapping R package to access both PaleoBioDB and Neotoma at the same time, as well as linking to many online museum databases and iDigBio. While these packages are a great convenience and the neotoma package is the subject of our next module, we wanted to introduce you to the underlying architecture here so you would understand what these packages are doing. If they cannot give you the data or fomat you need for your work, they can be cracked open and a custom solution can be easily put together.
neotoma PackageInstall the neotoma package, then add it to your programming environment
# Uncomment this line if you haven't already installed any of these packages:
# install.packages(c("neotoma"))
# Note, because the `neotoma` package is under active development there may be very recent changes
# you may be interested in. To get those, uncomment the following lines:
# install.packages('devtools')
# devtools::install_github('ropensci/neotoma')
library(neotoma)
neotoma has three core commands: get_site, get_dataset, and get_download. The first two return metadata for sites and datasets; the latter returns data. See Goring et al. (Goring et al. 2015) for a full description of the package and example code. This exercise is partially based on those examples.
We’ll start with get_site. get_site returns a data.frame with metadata about sites. You can use this to find the spatial coverage of data in a region (using get_site with a bounding box), or to get explicit site information easily from more complex data objects. Use the command ?get_site to see all the options available.
You can easily search by site name, for example, finding “Samwell Cave”.
samwell_site <- get_site(sitename = 'Samwell%')
Examine the results
print(samwell_site)
## site.name long lat elev
## Samwell Cave -122.2379 40.91691 465
## A site object containing 1 sites and 8 parameters.
While samwell_site is a data.frame it also has class site, that’s why the print output looks a little different than a standard data.frame. That also allows you to use some of the other neotoma functions more easily.
By default the search string is explicit, but because older sites, especially pollen sites entered as part of COHMAP, often had appended textual information (for example (CA:British Columbia)), it’s often good practice to first search using a wildcard character. For example, searching for “Marion” returns three sites:
marion_site <- get_site(sitename = 'Marion%')
print(marion_site)
## site.name long lat elev
## Marion Lake (CA:British Columbia) -122.54722 49.30833 305
## Marion Landfill -83.18611 40.59167 NA
## Marion Lake -121.86241 44.55770 1259
## A site object containing 3 sites and 8 parameters.
You can also search by lat/lon bounding box. This one roughly corresponds to Florida.
FL_sites <- get_site(loc = c(-88, -79, 25, 30))
You can also search by geopolitical name or geopolitical IDs (gpid) stored in Neotoma. For a list of names and gpids, go to http://api.neotomadb.org/apdx/geopol.htm, or use the get_table(table.name = "GeoPoliticalUnits") command. This command works either with an explicit numeric ID, or with a text string:
# get all sites in New Mexico (gpid=7956)
NM_sites <- get_site(gpid = 7956)
# get all sites in Wisconsin
WI_sites <- get_site(gpid = "Wisconsin")
data.frame stores vectors of equal length. The nice thing about a data.frame is that each vector can be of a different type (character, numeric values, etc.). In RStudio, you can use the Environment panel in upper right to explore variables.
We pointed out before that the object returned from get_site is both a data.frame and a site object. Because it has a special print method some of the information from the full object is obscured when printed. You can see all the data in the data.frame using str (short for structure):
str(samwell_site)
Let’s look at the description field:
samwell_site$description
marion_site$description
The structure of the Neotoma data model, as expressed through the API is roughly: “counts within download, download within dataset, dataset within site”. So a dataset contains more information than a site, about a particular dataset from that site. A site may have a single associated dataset, or multiple. For example:
samwell_datasets <- get_dataset(samwell_site)
print(samwell_datasets)
get_dataset returns a list of datasets containing the metadata for each dataset
We can pass output from get_site to get_dataset, even if get_site returns multiple sites
marion.meta.dataset <- get_dataset(marion_site)
Let’s look at the metadata returned for the Marion% search:
marion.meta.dataset
Both Marion Lake (CA: British Columbia) and Marion Landfill have a geochronology dataset, while Marion Lake (CA: British Columbia) has a pollen dataset and Marion Landfill a vertebrate fauna dataset. The third site, Marion Lake, has a diatom dataset and a water chemistry dataset.
Question 13: Are Marion Lake (CA: British Columbia) and Marion Lake the same site? Try using the browse function.
Further searches (i.e., to examine the lat/longs) or consulting the literature would be required. This illustrates the caution needed when using this, or any other, database. Human judgement still needs to be exercised when examining results from databases. The get_publication() function can return citation information if neccessary, but often using the browse() function is very helpful in examining a database further.
get_download() returns a list that stores a list of download objects - one for each retrieved dataset. Note that get_download() returns the actual data associated with each dataset, rather than a list of the available datasets, as in get_dataset() above. Each download object contains a suite of data for the samples in that dataset.
Get data for all datasets at Samwell Cave. get_download() will accept an object of class dataset (ie, samwell_dataset), but also of class site, since it will automatically query for the datasets associated in each site. Compare the returned messages for the following two commands:
samwell_all <- get_download(samwell_site)
## Warning in get_download.default(datasetid, verbose = verbose): Some datasetids returned empty downloads, be aware that length(datasetid) is longer than the download_list.
samwell_all <- get_download(samwell_datasets)
## Warning in get_download.default(datasetid, verbose = verbose): Some datasetids returned empty downloads, be aware that length(datasetid) is longer than the download_list.
print(samwell_all)
There are a number of messages that appear. These should be suppressed with the flag verbose = FALSE in the function call. One thing you’ll note is that not all of the datasets can be downloaded directly to a download objct. This is because geochronologic datasets have a different data structure than other data, requiring different fields, and as such, they can be obtained using the get_geochron() function:
samwell_geochron <- get_geochron(samwell_site)
print(samwell_geochron)
The result is effectively the inverse of the first.
Get the vertebrate datasets for just Samwell Cave Popcorn Dome (dataset 14262):
samwell_pd <- get_download(14262)
## API call was successful. Returned record for Samwell Cave
Let’s examine the available data in this download
str(samwell_pd[[1]])
There are 6 associated fields:
Within the download object, sample.meta stores the core depth and age information for that dataset. We just want to look at the first few lines, so are using the head() function. Let’s explore different facets of the dataset
head(samwell_pd[[1]]$sample.meta)
#taxon.list stores a list of taxa found in the dataset
head(samwell_pd[[1]]$taxon.list)
#counts stores the the counts, presence/absence data, or percentage data for each taxon for each sample
head(samwell_pd[[1]]$counts)
If we have time, we can work through the example given the Neotoma package paper in Open Quaternary.
Jump to the section “Examples”, then scroll down to “Mammal Distributions in the Pleistocene”. The R code is reproduced below, with some modifications to accomodate changes since publication.
# install.packages('ggplot2','reshape2')
library("ggplot2")
library("reshape2")
# Bounding box is effectively the continental USA, excluding Alaska
mam.set <- get_dataset(datasettype= 'vertebrate fauna', loc = c(-125, 30, -115, 49.5))
# Retrieving this many sites can be very time consuming.
# If you are running code that takes a long time, but has an obvious output, you can stick it
# in an output file and call it later. This is a nice trick :)
# if ("mam_dl.rds" %in% list.files('objects')) {
# mam.dl <- readRDS('objects/mam_dl.rds')
# } else {
# mam.dl <- get_download(mam.set)
# saveRDS(mam.dl, file = 'objects/mam_dl.rds')
# }
githubURL <- "https://github.com/NeotomaDB/Workshops/blob/master/SVP2016/R/objects/mam.dl.RData?raw=true"
load(url(githubURL))
compiled.mam <- compile_downloads(mam.dl)
time.bins <- c(500, 4000, 10000, 15000, 20000)
mean.age <- rowMeans(compiled.mam[,c('age.old','age.young', 'age')], na.rm = TRUE)
interval <- findInterval(mean.age, time.bins)
periods <- c('Modern',
'Late Holocene',
'Early-Mid Holocene',
'Late Glacial',
'Full Glacial',
'Late Pleistocene')
compiled.mam$ageInterval <- periods[interval + 1]
mam.melt <- melt(compiled.mam, measure.vars = 11:(ncol(compiled.mam)-1), na.rm = TRUE, factorsAsStrings = TRUE)
mam.melt <- transform(mam.melt, ageInterval = factor(ageInterval, levels = periods))
mam.lat <- dcast(data = mam.melt, variable ~ageInterval, value.var = 'lat',
fun.aggregate = mean, drop = TRUE)[,c(1, 3, 5, 6)] # variable Late Holocene Late Glacial Full Glacial
# We only want taxa that appear at all time periods:
mam.lat <- mam.lat[rowSums(is.na(mam.lat)) == 0, ]
# Group the samples based on the range &direction (N vs S) of migration.
# A shift of only 1 degree is considered stationary.
mam.lat$grouping <- factor(findInterval (mam.lat[,2] - mam.lat[, 4], c(-11, -1, 1, 20)),
labels = c('Southward','Stationary','Northward'))
mam.lat.melt <- melt(mam.lat)
colnames(mam.lat.melt)[2:3] <- c('cluster','Era')
ggplot(mam.lat.melt, aes(x = Era, y = value)) +
geom_path(aes(group = variable, color =cluster)) +
facet_wrap(~ cluster) +
scale_x_discrete(expand = c(.1,0)) +
ylab('Mean Latitude of Occurrence') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
neotoma R package: https://github.com/ropensci/neotoma (also accessible via the NeotomaDB github)Goring, S., A. Dawson, G. L. Simpson, K. Ram, R. W. Graham, E. C. Grimm, and J. W. Williams. 2015. neotoma: A programmatic interface to the neotoma paleoecological database. Open Quaternary 1:2.
| Funded by:
National Science Foundation Directorate for Geosciences Division of Earth Sciences |
EarthCube Integrative Activities National Sciences Foundation |